Session 3 covered:
Writing functions
Function scopes
Installing packages
Using packages
R markdown
Real life example: DESeq2
# Functions
multiply = function(x, y) return(x * y)
multiply(3, 2)
# Default arguments
multiply = function(x, y=2) return(x * y)
z = multiply(3)
# Scopes
multiply = function(x, y) z <<- x * y
multiply(3, 2)
# Passing through arguments
apply(m, 1, multiply, y=2)What did you learn by trying out your own analysis?
Do we need to recap anything?
Session 4
What makes a good figure?
Colour
viridis colour library
Plotting with ggplot2
Aesthetics
Geometries
Themes
Facets
Saving
Humans are visual creatures!
Visualisation of large data sets is an essential task in molecular biology and medicine. When done effectively, images can help you to explain the most complex of data.
Without an image or summarisation, how well would you do finding significant genes passing a given fold change threshold in a table like this?
## baseMean log2FoldChange lfcSE pvalue padj
## PENG0000000001 1.1941808 -0.11077611 0.46558089 1.822729e-01 1.000000e+00
## PENG0000000002 0.1394510 -0.01310322 0.42899817 8.004145e-01 1.000000e+00
## PENG0000000004 0.7393672 -0.01877427 0.39858386 7.883606e-01 1.000000e+00
## PENG0000000006 44.3468742 0.08682659 0.19961999 4.041232e-01 6.352165e-01
## PENG0000000007 788.9630161 1.65624481 0.20412453 2.062683e-18 1.994904e-16
## PENG0000000009 1345.8691348 -0.15601837 0.13576046 9.532748e-02 2.327564e-01
## PENG0000000010 4.4468600 0.04479898 0.29043812 6.594170e-01 1.000000e+00
## PENG0000000012 671.1614707 -0.14329193 0.10518787 7.279070e-02 1.885944e-01
## PENG0000000014 0.1394510 -0.01310322 0.42899817 8.004145e-01 1.000000e+00
## PENG0000000015 770.7712490 0.70623622 0.15143560 1.487255e-07 2.841074e-06
## PENG0000000016 5.3302466 0.14048684 0.33883106 2.315969e-01 1.000000e+00
## PENG0000000022 1276.9182728 -0.06159326 0.07899988 3.184644e-01 5.180787e-01
There are a few key points you need to consider when deciding upon a method of visualisation:
relevance: the message a figure needs to convey
salience: how easily the eye can distinguish your message from the background
accuracy: how exactly different visualisation methods may convey your message
Especially when presenting (when the audience is listening and reading), it’s vital that salience and relevance are aligned.
Human vision is highly selective. We understand visual information by selecting, in turn, individual objects or aspects for detailed analysis rather than by appreciating an entire scene.
Formally, salience is the property of an object that sets it apart from its surroundings; it’s a relative properly, therefore, and depends on the collection of objects being visualised.
We can enhance salience by manipulating color, shape, size, and position to focus attention.
It’s not inherently different to make information salient …
… what’s important is that the salient points of a figure align with what’s relevant.
Nevertheless, it’s also easy to reduce salience by:
displaying too much information together
attempting to convey different points of relevance within the same image
referring to points of relevance that are tangential to what’s salient
When displaying a graphic, we want the viewer to be able to perceive the patterns and trends that convey the relevant point.
Humans are better able to interpret certain visual cues better than others, however - interpretation is subjective and everyone’s different! Let’s try and rank the following methods of conveying the same information:
Research in the field of visual theory has shown that, in order, people are best able to understand:
positions on a common aligned scale
positions on unaligned scales that are otherwise common
lengths
angles and slopes
area
volume and colour saturation
colour hue
Unfortunately, although it’s often a go-to method, colour is amongst the least reliable methods of conveying information!
This is worsened by the fact that colour perception is relative. We distinguish colours differently depending on their surroundings and can be easily tricked into seeing the same colour differently or into seeing different colours similarly.
Not only can be we be fooled into perceiving colours contextually, people differ in their ability to distinguish colours based on their genetics.
Colour blindness (colour vision deficiency or CVD) is a sliding scale and there are a number of different types. ‘Full’ phenotypes for the three most common forms of CVD are show here.
Across populations with Northern European ancestry, up to 1/12 males and 1/200 females have some level of red-green CVD. UK-wide, 4.5% of the population have some level of CVD and, even by the time they leave school, approximately 40% are unaware.
When producing figures that make use of colour, if we want them to be salient, it’s important to be inclusive! Thankfully, a number of colour scales have been developed that allow figures to retain salience for those with CVD.
The viridis library provides color maps for use in
R that are:
colourful, spanning as wide a palette as possible
perceptually uniform, such that, across the whole range, nearby values have similar-appearing colors and distant values appear more distinct
friendly for those with colour vision deficiency
viridis can be installed from CRAN.
Although base R can produce a variety of plots, getting
them to be ‘just so’ can be extremely difficult!
These days, virtually all ‘pretty’ plotting is done with the
ggplot2 library.
Given its popularity, lots of libraries interface with
ggplot2 (e.g. viridis). To complement
ggplot2, we’ll also install patchwork, which
helps to layout plots in rows and grids.
The central ggplot() function provides a consistent
interface to map data to the aesthetics of
geometries.
In other words:
the tabular data that we wish to plot …
… has various facets, which we map to the aesthetic properties we can perceive (position, colour, shape, size, and transparency) …
… of various geometric methods of displaying the data (bars, points, lines, etc)
Most commonly, we make a plot by passing the ggplot()
function two arguments:
data, which takes a data frame (or something that
can be coerced to one)
mapping, which uses the aes() function
to define the aesthetic mappings we’d like to use
penguins = na.omit(read.csv("data/1_palmerpenguins.csv"))
penguins$species = factor(penguins$species)
penguins$island = factor(penguins$island)
penguins$sex = factor(penguins$sex)
penguins$year = factor(penguins$year, ordered=TRUE)## [1] "gg" "ggplot"
Here, we’ve instantiated a ggplot object with the
penguins data frame and defined two simple aesthetics -
that the x axis should display bill_length_mm
and that y should display bill_depth_mm.
Let’s have a look at what we’ve made.
Beautiful, we’re finished!
Remember that a ggplot requires data, aesthetics, and
geometries. So far, we haven’t added any geometries to our plot. We can
think of a ggplot as a painting; ggplot()
provides the canvas and the geometries provide the layers of paint.
Geometries (geom_...() functions) added to a
ggplot2 display data according to the aesthetics defined in
the ggplot object. Geometries are added (literally, as we
use the + operator) to the ggplot and we
reassign the result to the original object variable name.
Now we can see something!
What we have so far is a little simplistic! Let’s re-work the
aesthetics by getting the ggplot object to colour according
to the levels of the species factor (color and
col are also valid if you’re not British).
The aesthetic properties defined in the ggplot()
function call are set as the defaults for all geometries, provided
they’re applicable. All geometries added will be passed these
defaults.
Here, the linear model (method="lm") trend lines we add
using geom_smooth() inherit their colour
aesthetic from the ggplot parent object.
Let’s make our first ggplot!
Set up a ggplot object of displaying
flipper_length_mm against body_mass_g
Add a colour aesthetic to the plot
Add a geom_smooth(method="lm") geometry to the
plot
What happens if we exchange geom_point() for
geom_density2d()?
The more facets of the data we map to aesthetics, the more it’s partitioned.
g = ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, colour=species, shape=sex))
g +
geom_point() +
geom_smooth(method="lm")Here, the shape=sex aesthetic further subdivided the
data by sex, giving 6 trend lines instead of 3.
We don’t always want our geometries to inherit all of the defaults
passed to the ggplot() function, therefore! Thankfully:
individual geometries can have alternate aesthetics passed using
the mapping= argument, which allows individual aesthetic
parameters to be reset to NULL.
geom_point(mapping=aes(shape=NULL))
aesthetics can be manually specified for all data by passing the aesthetic parameter as an argument itself
geom_point(colour="black")
Here, we pass alpha=0.5 to geom_point() to
manually set the alpha (transparency) of the points. Additionally, we
override the shape aesthetic within
geom_smooth() by setting it to NULL so that we
don’t duplicate our trend lines.
Some aesthetics are only available for specific geometries and, depending on the geometry, certain aesthetic options are more relevant or salient than others.
As a general guide, discrete variables are best visually separated with …
shape (for points) for very small numbers of
discrete groups and where overlapping is minimal
linetype (for lines) for very small numbers of
discrete groups
colour (or fill for filled geometries)
using separate hues
… whereas continuous variables are most accurately displayed using:
size or lineweight
a scaled gradient across distant colour
hues
a scaled gradient across a single colour
alpha, if necessary
Separately to the aesthetics of the geometries, we can control other
visual aspects of the plot by modifying the theme()
defaults or by using one of the theme_...() presets. We can
also use the labs() function to control the axis and plot
titles.
Let’s add some style to our ggplot.
Update your previous plot to use a theme preset. Try out
theme_bw(), theme_classic(), and
theme_minimal().
Even when using a theme preset, we can still override specific
elements using theme(). What does adding
theme(axis.text.y=element_text(angle=90, vjust=0.5, hjust=0.5))
achieve?
How might we rotate the labels for the x axis?
Looking at the help for element_text(), how might we
change the size of the plot.title element?
There are many geometries available to help display data of different formats. The majority of graphing applications fall into five groups:
single continuous variable: geom_freqpoly(),
geom_histogram(), geom_area(),
geom_density()
single discrete variable: geom_bar()
two continuous variables: geom_point(),
geom_smooth(), geom_rug(),
geom_density_2d()
two variables, one continuous and one discrete:
geom_boxplot(), geom_violin(),
geom_dotplot(), geom_jitter(),
geom_col()
two discrete variables: geom_count(),
geom_jitter()
Let’s see a few options for plotting a continuous against a discrete variable.
Different geometries can be used to highlight - make salient -
different aspects of the data. Here, geom_boxplot() better
shows the position of the median value, whereas
geom_violin() better highlights the spread of the data and
geom_jitter() might do that too much!
Let’s see a few options for plotting two continuous variables against each other.
Here, as an alternative to geom_point(),
geom_density2d() may better highlight the ‘centre of mass’
for each group. geom_rug() might be suitable alongside
another geometry but otherwise lacks both accuracy and saliency.
Using this base …
… let’s compare a couple of options for plotting a single continuous variable.
Add a geom_histogram() to the plot
Switch that for a geom_freqpoly()
Which plot has better accuracy and saliency?
Let’s apply some of our knowledge about using colour effectively … and inclusively!
The viridis library integrates easily with
ggplot2 using various
scale_colour_viridis_...() and
scale_fill_viridis_...() functions:
scale_colour_viridis_d() makes salient mappings of
discrete variables
scale_colour_viridis_c() makes accurately
distinguishable mappings for continuous variables
scale_colour_viridis_b() merges the two - continuous
variables are binned to enhance salience
Let’s start by replacing ggplot2’s default colour scheme
for discrete variables using the scale_colour_viridis_d()
function.
ggplot(penguins, aes(x=body_mass_g, y=flipper_length_mm, colour=species)) +
geom_point() +
theme_bw() +
scale_colour_viridis_d()Pretty easy!
As we saw earlier, viridis has a variety of palettes.
Let’s compare a few on some discrete data by passing the
option= argument to the
scale_colour_viridis_d() function.
Using this base …
What are we doing in the colour aesthetic?
Make a graph making use of
scale_colour_viridis_c()
Try a few different palettes by setting
option=
Switch that out for
scale_colour_viridis_b()
Within this nonsense example, which method has:
better accuracy?
better saliency?
Where we need to highlight two separate points within a plot, it’s common to incorporate the second factor using a separate aesthetic (e.g. as a symbol if we’re primarily using colour). However, mixing or increasing the complexity of aesthetics can lead to poor saliency!
g = ggplot(na.omit(penguins), aes(x=bill_length_mm, y=bill_depth_mm, colour=species)) +
scale_colour_viridis_d(option="turbo") +
theme_bw()
g + geom_point(aes(shape=sex))Often, a better alternative is to facet the data by the second factor.
facet_wrap() will use a factor (/factors) specified
using formula notation to split the dataset into separate, grouped,
plots.
Alternatively, for more complex layouts, we can use
facet_grid() to produce a 2D grid.
For facet_grid(), it’s necessary to use
ggplot2’s vars() function to extract the
levels from the factors you’re faceting with.
Using this base …
geom_density() to display the
distribution of body_mass_g or each
speciesLet’s see if there’s variation in their weight by year.
facet_grid() that separates by year
and sexThere are two ways of adjusting the axis scales in
ggplot2:
setting everything outside of a specified window to
NA. This is achieved by setting limits= within
the relevant scale_...() function
scale_x_continuous(limits=c(1, 100))
zooming into a region without otherwise manipulating the data.
This is achieved by setting xlim= and ylim=
using the coord_cartesian() function
coord_cartesian(xlim=c(1, 100))
These differ in whether the non-visible points are used to contribute information to the plot:
limits prevents this. A geom_smooth()
geometry will create a fit using only the points displayed
coord_cartesian() enables this. A
geom_smooth() geometry will create a fit using all data,
even if it’s not displayed.
Alongside the adjustment of the scales, it’s often useful to be able to add reference or threshold lines to a plot to enhance saliency.
ggplot2 has three geometries that allow this:
geom_hline(), taking a yintercept
argument
geom_vline(), taking a xintercept
argument
geom_abline(), taking slope and
intercept arguments
g =
ggplot(na.omit(penguins), aes(x=species, y=body_mass_g, colour=species)) +
geom_violin(fill=NA) +
geom_hline(yintercept=3000, linetype="dashed") +
theme_bw()
gIt’s also possible to add text annotations to a plot using
annotate() elements, using the x and
y arguments to position the text.
For the purposes of display (and for positioning labels), categorical
data, as above, are converted to sequences of integers (1, 2, 3, …) on
the x axis.
It’s also possible to label specific points within a plot using
geom_text() or geom_text_repel() from the
ggrepel package, which repels labels from each other to
prevent overlap.
Where there are only a few points, this works without further effort.
Where there are lots, it’s more common to be needing to specifically
label a specific subset. To do this, we supply new data to
the geometry.
Saving ggplot objects is simpler than for base
R graphics.
ggplot2 has a special ggsave() function
that will adjust width and height in a variety
of units and save in 10 different formats!
Here, we pass a specific ggplot object to save but, by
default, it saves the last displayed plot.
The ggsave() function is also clever enough to guess the
required device (png, svg,
pdf, etc) from the file extension given in the path name;
we can specify this for clarity but it’s not necessary.
There’s some graphing homework to help cement what we’ve covered today!
The homework and instructions can be found within the main directory
for the course: ./homework/Homework_4.Rmd